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I (57) Abstract 

I t - n ;c ^hnrfied in an aoDaratus and related method, for sensing a person's facial movements, features and 

The present invention is embodied in an apparaius, ana rc «. n <;ine The avatar apparatus uses an image processing 

characteristL and the like to generate and animate an avatar ' m ^^ jet s are composed of wavelet 

1 technique based on model graphs and fcat i res . ^ nodes are acquire d 

transforms processed at node or landmark locations on an movements Also, the facial sensing may use jet similarity 

elements that may interfere or inhibit the person's natural characteristics. 
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WAVELET-BASED FACIAL MOTION 
CAPTURE FOR AVATAR ANIMATION 

Field of the Invention 

5 The present invention relates to dynamic facial 

feature sensing, and more particularly, to a vision- 
based motion capture system that allows real-time 
finding, tracking and classification of facial features 
for input into a graphics engine that animates an 

10 avatar. 



R^rkaround of the Invention 

Virtual spaces filled with avatars are an 
attractive way to allow for the experience of a shared 

15 environment. However, existing shared environments 
generally lack facial feature sensing of sufficient 
quality to allow for the incarnation of a user, i.e., 
the endowment of an avatar with the likeness, 
expressions or gestures of the user. Quality facial 

20 feature sensing is a significant advantage because 
facial gestures are a primordial means of 
communications. Thus, the incarnation of a user 
augments the attractiveness of virtual spaces. 
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Existing methods of facial feature sensing 
typically use markers that are glued to a person's 
face. The use of markers for facial motion capture is 
cumbersome and has generally restricted the. use of 
5 facial motion capture to high-cost applications such as 

movie production. 

Accordingly, there exists a significant need for a 
vision based motion capture systems that implements 
convenient and efficient facial feature sensing. The 
10' present invention satisfies this need. 

g-mmary o f t--h«»- Tnvention 

The present invention is embodied in an apparatus, 
and related method, for sensing a person's facial 
15 movements, features or characteristic. The results of 
the facial sensing may be used to animate an avatar 
xmage. The avatar apparatus uses an image processing 
technique based on- model graphs and bunch graphs that 
efficiently represent image features as jets composed 
20 of wavelet transforms at landmarks on a facial image 
corresponding to readily identifiable features. The 
sensing system allows tracking of a person's natural 
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characteristics without any unnatural elements to 
interfere with the person's natural characteristics. 

The feature sensing process operates on a sequence 
of image frames transforming each image frame using a 
5 wavelet transformation to generate a transformed image 
frame. Node locations associated with wavelets jets of 
a model graph to the transformed image frame are 
initialized by moving the model graph across the 
transformed image frame and placing the model graph at 
10 a location in the transformed image frame of maximum 
jet similarity between the wavelet jets at the node 
locations and the transformed image frame. The 
location of one or more node locations of the model 
graph is tracked between image frames. A tracked node 
15 is reinitialized if the node's position deviates beyond 
a predetermined position constraint between image 
frames. 

In one embodiment of the invention, the facial 
feature finding may be based on elastic bunch graph 
20 matching for individualizing a head model. Also, the 
model graph for facial image analysis may include a 
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plurality of. location nodes (e.g., 18) associated with 
distinguishing features on a human face. 

Other features and advantages of the present 
invention should be apparent from the following 
5 description of the preferred embodiments, taken in 
conjunction with the accompanying drawings, which 
illustrate, by way of example, the principles of the 
invention. 

FIG. l.isa block diagram of an avatar animation 
system and process, according to the invention. 

FIG. 2 is block diagram of a facial feature 
sensing apparatus and process, according to the 
,5 invention, for the avatar animation system and process 
of FIG. 1. 

FIG. 3 is a block diagram of a video image 
processor for implementing the facial feature sensing 

apparatus of FIG. 2. 
20 FIG. 4 is a flow diagram, with accompanying 

. photographs, for illustrating a landmark finding 
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technique of the facial feature sensing apparatus and 

system of FIG. 2. 

FIG. 5 is a seriesof images showing processing of 
a facial image using Gabor wavelets, according to the 

5 invention. 

FIG. 6 is a series of graphs showing the 
construction of a jet, image graph, and bunch graph 
using the wavelet processing technique of FIG. 5, 
according to the invention. 
10 pic . 7 is a diagram of a model graph; according to 

the invention, for processing facial images. 

FIG. 8 includes two diagrams- showing the use of 
wavelet processing to locate facial feature. 

FIG. 9 is a flow diagram showing a tracking 
15 technique for tracking landmarks found by the landmark 
finding technique of FIG. 4. 

FIG. 10 is a diagram of a Gaussian image pyramid 
technique for illustrating landmark tracking in one 
dimension. 

20 FIG. 11 is a series of two facial images, with 

' accompanying graphs of pose angle versus frame number, 
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showing tracking of facial features over a sequence of 

50 image frames. 

FIG. 12 is a flow, diagram, with accompanying 
photographs, for illustrating a pose estimation 
5 technique of the facial feature sensing apparatus and 

system of FIG.. 2 . 

FIG. 13 is a schematic diagram of a face with 
extracted eye and mouth regions, for illustrating a 
course-to-fine landmark finding technique. 
10 FIG. 14 are photographs showing the extraction of 

. profile and facial features using the elastic bunch 
graph technique ; of FIG... 6 • 

FIG. 15 is a flow diagram showing the generation 
of a tagged personalized bunch graph along with a 
15 corresponding gallery of image patches that encompasses 
a variety of a person's expressions for avatar 
animation, according with the invention. 

FIG. 16 is a. flow ..diagram showing a technique for 
animating an avatar using image patches that are 
. 20 transmitted to a gemote site, and that are selected at 
• the remote site based on transmitted tags based on 
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facial sensing associated with a person's current 
facial expressions. 

FIG. 17 is a flow diagram showing rendering of a 
three-dimensional head image generated, based on facial 
5 feature positions and tags, using volume ■ morphing 
integrated with dynamic texture generation.- 

FIG. 18 is a block diagram of an avatar animation 
system, according to the invention, that includes audio 
analysis for animating an avatar. 
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t^;-,^ ^^r^Hon of the Preferred Embodiments 

The present invention is embodied, in an apparatus, 
and related method, for sensing a person's facial 
movements, features and characteristics and the like to 
generate and animate an avatar image based on the 
facial sensing. The avatar apparatus uses an image 
processing technique based on model graphs and bunch 
graphs that efficiently represent image features as 
jets. The jets are composed of wavelet transforms are 
20 processed at node or landmark locations on an image 
corresponding to readily -identifiable features. The 
nodes are acquired and tracked to animate an avatar 
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image in accordance with th. parson- facia! «nt, 
al so, the facial sensing may. use jet similarity to 
determine the person' s facial features and 
characteristics thus allowing tracking of a person's 
5 natural characteristics without any unnatural elements 
that may interfere with th. person's natural 

characteristics. 

As shown in FIG. 1, the avatar animation system 10 
of the invention includes an imaging system 12, a 

ia * asta communication network 
10" facial sensing process 14, a data comm , 

16 , a facial animation process 18, and an avatar 
■ diS play-20. The imaging, system acquires and digitizes 
a live video image signal of. a person thus generating a 
stream of digitized video data organized into image 
15 frames. The digitized video image data is provided to 
th. facial sensing process which locates the person's 
face and corresponding facial features in each frame. 
The facial sensing process ^ tracks, the positions 
and characteristics of- the facial features from frame- 
20 to-frame. The tracking information may be transmitted 
via the network to , one or .ore .remote sites which 
receives the information and generates using a 
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graphics engine, an animated facial image on the avatar 
display. The animated facial image may be based on a 
photorealistic model of the person, a cartoon character 
or a face completely unrelated to the user. 
5 The imaging system 12 and the facial sensing 

process 14 are shown in more detail in FIGS. 2 and 3. 
The imaging system captures the person's image using a 
digital video camera 22 which generates a stream of 
video image frames. The video image frames are 
10 transferred into a video random- access memory (VRAM) 24 
for processing. A satisfactory, imaging system is the 
Matrox Meteor II available:, from Matrox™ which generates 
digitizing images produced by a conventional CCD camera 
and transfers the images in real-time into the memory 
15 at a frame rate of 30Hz. The image frame is processed 
by an image processor 26 having a central processing 
unit (CPU) 28 coupled to the VRAM and random- access 
memory RAM 30. The. RAM stores program code and data 
for implementing the facial sensing and avatar 
20 animation processes. 

The facial feature- .process operates on the 
digitized images to find., the person's facial feature 
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(block 32), track the features (block 34), and 
reinitializes feature tracking as needed. The facial 
features also may be classified (block 36) . The facial 
feature process generates data associated with the 
5 position and classification of the facial features, with 
is provided to an interface with the facial animation 

process (block 38) 

The facial feature may be located using an elastic 
• graph matching shown in FIG. 4. In the elastic graph 
10 matching technique, a captured image (block 40) is 
transformed into Gabor space using a wavelet 
transformation (block. 42) which is described below in 
more detail with respect to FIG. 5. The transformed 
image (block 44) is represented by 40 complex values, 
15 representing wavelet components, per each pixel of the 
original image. Next, a rigid copy of a model graph, 
which is described in more detail below with respect to 
FIG. 7, is positioned over the transformed image at 
varying model node positions to locate a position of 
• 20 optimum similarity (block 46) . The search for the 

optimum similarity may be performed by positioning the 
model graph in. the upper left hand corner of . the image. 
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extracting the jets at the nodes, and determining the 
similarity between the image graph and the model graph. 
The search continues by sliding the model graph left to 
right starting from the upper- left corner of the image 
5 (block 48) . When a rough position of the face, is found 
(block 50) , the nodes are individually allowed to move, 
introducing elastic graph distortions (block 52) . A 
phase- insensitive similarity function is used in order 
to locate a good match (block 54) . A phase -sensitive 
10 similarity function is then used to locate a jet with 
accuracy because the phase is very sensitive to small 
jet displacements. The 'phase -insensitive and the 
phase -sensitive similarity functions are described 
below with respect to FIGS. 5-8. Note that although 
15 the graphs are shown in FIG. 4 with respect to the 

original image, the model graph movements and matching 
are actually performed on the transformed image. 

The wavelet transform is described with reference 
to FIG. 5. An original image is processed using a 
20 Gabor wavelet to generate ;a convolution result. The 
Gabor-based wavelet consists of , a two-dimensional 
■ complex wave field modulated by a Gaussian envelope. 



r f 
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5 The wavelet is a plane wave with wave vector * . 

restricted by a Gaussian window, the size of which 
relative to the wavelength is parameterized by a. The 
term in the brace removes the DC component. The 
amplitude of the wavevector k may be chosen as follows 

10 where v is related to the desired special resolutions. 

(2 ) 

k v =2 2 n,v = \X- 
A wavelet, centered at image position ~x is used to 
extract the wavelet component ■J i from the image with 

gray level distribution I(x) , 

\dx'I{x')y/ E (x-x') 

The space of wave vectors * is typically sampled 
in a discrete hierarchy of 5 resolution levels 
20 (differing by half -octaves) and 8 orientations at each 
resolution level (see, e.g., FIG . 8), thus generating 
40 complex values for each sampled image point (the 
real and imaginary components referring to the cosine 
and sine phases of the plane wave) . The samples in k- 
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space are designated by the index j = 1,..,40 and all 
wavelet components centered in a single image point are 
considered as a vector which is called a. jet 60, shown 
* in FIG. 6. Each jet describes the local features of 

5 the area surrounding x . If sampled with sufficient 
density, the image may be reconstructed from jets 
withxn the bandpass covered by the sampled frequencies. 
Thus, each component of a jet is the filter response of 
a Gabor wavelet extracted at a point (x, y) of the 

10 image. 

A labeled image graph 62, as shown in FIG. 6, is 
used to describe the aspect of an object (in this 
context, a face) . The nodes 64 of the labeled graph 
refer to points on the object and are labeled by jets 

15 60. Edges 66 of the graph are labeled with distance 

vectors between the nodes. Nodes and edges define the 
graph topology*. Graphs with equal topology can be 
compared. The normalized dot product of the absolute 
components of two jets defines the jet similarity- This 

20 value is independent of contrast . changes . To compute 
the similarity between two graphs, the sum is taken 
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over similarities of corresponding jets between the 
graphs . 

,A model graph 68 that is particularly designed for 
finding a human face in an image is shown in FIG. 7. 
5 The numbered nodes of the graph have the following 
locations: 



0 right eye pupil 

1 left eye pupil 
10 2 top of the nose 

3 right corner of the right eyebrow 

4 left corner of the right eyebrow 

5 right corner of the left eyebrow 

6 left corner of the left eyebrow 
15 7 right nostril 

8 tip of the. nose 

9 left nostril 

10 right corner of the mouth 

11 center of the upper lip 
20 12 left corner of the mouth 

13 center of the lower lip 

14 bottom of the right ear 

15 top of the right ear 

16 top of the left ear 

25 17 bottom of the left ear 

To represent a face, a data structure called bunch 
graph 70 (FIG . 6) is used. It is similar to the graph 
described above, but instead of attaching only a single 
30 jet to each node, a whole bunch of jets 72 (a bunch 
jet) are attached to each node. Each jet is derived 
from a different facial image. To form a bunch graph, 
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a collection of facial images (the bunch graph gallery) 
is marked with node locations at defined positions of 
the head. These defined- positions are called 
landmarks. When matching a bunch graph to an image, 
5 the jet extracted from the image is compared to all 
jets in the corresponding bunch attached to the bunch 
graph and the best -matching one is selected. This 
matching process is called elastic bunch graph 
matching. When constructed using a judiciously 
10 selected gallery, a bunch graph covers a great variety 
of faces that may have significantly different local 
properties e.g. samples of male and female faces, and 
of persons of different ages or races. 

Again in order to find a face in an image frame, 
15 the graph is moved and scaled and distorted until a 

place is located at which the graph matches best (the 
best fitting jets within the bunch jets are most 
similar to jets extracted from the image at the current 
positions of the nodes) • Since face features differ 
from face to face, the graph is made more general for 
the task, e.g. each node is assigned with jets of the 



20 
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corresponding landmark taken from 10 to 100 individual 
faces. 

Two different jet similarity functions for two 
different, or even complementary, tasks are employed. 
5 If the components of a jet J are written in the form 
with amplitude and phase q> 3 one form for the similarity 
of two jets J and J' is the normalized scalar product 
of the amplitude vector 



10 



15 



ZQ. CL ■ 
Sijj i J J (4) 

The other similarity function has the form 



20 



This function includes a relative displacement vector 
between the image points to which the two jets refer. 
When comparing two jets during graph matching, the 
25 similarity between them is maximized with respect to d. 
leading to an accurate determination of jet position. 
Both similarity functions are used, with preference 
often given to the phase- insensitive version (which 
varies smoothly with relative position) when first 
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matching a graph, and- to the phase- sensitive version 
when accurately positioning the jet. 

After the facial features are located, the facial 
features may be tracked over consecutive frames as 

5 illustrated in FIG. 9. The tracking technique of the 
invention achieves robust . tracking over long frame 
sequences by using a tracking correction scheme that 
detects whether tracking of a feature or node has been 
lost and reinitializes the tracking process for that 

10 node. 

The position X_n of a single node in an image I__n 
of an image sequence is known ' either by landmark 
finding on image I_n using the landmark finding method 
(block 80) described above/ or by tracking the node 

15 from image I__(n-1) to I_n using the tracking process. 
The node is then tracked (block 82) to a corresponding 
position X_(n+1) in the image I_(n+1) by one of several 
techniques. The tracking methods described below 
advantageously accommodate fast motion. 

20 A first tracking technique involves linear motion 

prediction. The search for the corresponding node 
position X_(n+1) in the new image I_(n+1) is started at 
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a position generated by a motion estimator.. A 
disparity vector (X_n -X_(n-D) is calculated that 
represents the displacement, assuming constant 
velocity, of the node between the preceding two frames. 
5 The disparity or displacement vector D_n may be added 
to the position X_n to predict the node position 
X_(n+1) . This linear motion model is particularly 
advantageous for accommodating constant velocity 
motion. The linear motion model also provides good 
10 tracking if the frame rate is high compared to the 

acceleration of the objects being. tracked. However, 
the linear motion model performs poorly if the frame 
rate is too low compared to the acceleration of the 
objects in the image sequence. Because it is difficult 
15 for any motion model to track objects under such 

• conditions, use of a camera having a higher frame rate 

is recommended. 

The linear motion model, may generate too large of 
an estimated motion vector D_n which could lead to an 
20 accumulation bf the error in the motion estimation. 

Accordingly, the linear prediction may be damped using 
• a ' damping factor f_D/ The resulting estimated motion 
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vector is D_n « f_D * (X_n - X_(n-D.) - A suitable 
damping factor is 0;9. If no previous frame I_(n-1) 
exist, e.g., for a frame immediately after landmark 
finding, the estimated motion vector is set equal to 

5 zero (D_n = 0) . 

A tracking technique based on a Gaussian image 
pyramid, applied to one dimension, is illustrated in 
FIG. 10. Instead of using the original image 
resolution, the image is down sampled 2-4 times to 

10 create a Gaussian pyramid of the image. An image 

pyramid of 4 levels results in a distance of 24 pixels 
on the finest, original resolution level being 
represented as only 3 pixels on the coarsest level. 
Jets may be computed and compared at any level of the 

15 pyramid. 

Tracking of a node on the Gaussian image pyramid 
is generally performed first at the most coarse level 
and then preceding to the most fine level. A jet is 
extracted on the coarsest Gauss level of the actual 
20 image frame I_(n+1) at the position X_(n+1) using the 
damped linear-motion estimation X_(n+1) = (X_n + D_n) 
as described above, and compared to the corresponding 
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jet computed on the coarsest Gauss level of the 
previous image frame. . From these two jets, the 
disparity is determined, i.e., the 2D vector R pointing 
from X_(n + 1) to that position that corresponds best to 
5 the jet from the previous frame. This new position is 
assigned to X_(n + l) - The disparity calculation is 
described below in more detail. The position on the 
next finer Gauss level of the actual image (being 
2*X_(n + D), corresponding to the position X_(n + 1) on 
10 the coarsest Gauss level is the starting point for the 
disparity computation on this next finer. level. The 
jet extracted "at this.:point is compared to the 
corresponding jet calculated on the same Gauss level of 
the previous image frame. This process is repeated for 
IS' all Gauss levels until the finest resolution level is 
reached, or until the. Gauss level is reached which is 
specified for determining the position of the node 
corresponding to the previous frame's position. 

Two representative levels of the Gaussian image 
20 pyramid are shown in FIG. 10, a coarser level 94above , 
■ a finer level 96 belbW. Each jet is assumed to have 

filter responses for two frequency, levels. Starting at 
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position 1 on the coarser Gauss level , X_ (n+l) =X_n+D_n, 
a first disparity move using only the lowest frequency 
jet coefficients leads to position 2. A second 
disparity move by using all jet coefficients of both 
5 frequency levels leads to position 3, the final 

position on this Gauss level. Position 1 on the finer 
Gauss level corresponds to position 3 on the coarser 
level with the coordinates being doubled. The 
disparity move sequence is repeated, and position 3 on 
: 10 the finest Gauss level is the final position of. the 
tracked landmark. For more accurate, tracking , the 
number of Gauss and frequency levels may be increased. 

After the new position of the tracked node in the 
actual image frame has been determined, the jets on all 
15 Gauss levels are computed at this position. A stored 

array of jets that was computed for the previous frame, 
representing the tracked node, is then replaced by a 
new array of jets computed for the current frame. 
Use of the Gauss image pyramid has two main 
20 advantages: First, movements of nodes, are much smaller 
in terms of pixels on a. coarser level than in the 
original image, which makes -tracking possible by 
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performing only a local move instead of an exhaustive 
search in a large image region. Second, the computation 
of jet components is much faster for lower frequencies, 
because the computation is performed with a small 
5 kernel window on a down sampled image, rather than on a 
large kernel window on the original resolution image. 

Note, that the correspondence level may be chosen 
dynamically, e.g., in the case of tracking facial 
features, correspondence level may be chosen dependent 
10 on the actual size of the face. Also the size of the 
Gauss image pyramid may be alter through the tracking 
process, i.e., the size m ay be . increased when motion 
gets faster, and decreased when motion gets slower. 
Typically, the maximal node movement on the coarsest 
15 Gauss level is limited 4 pixels. Also note that the 
motion estimation is often performed only on the 

coarsest level . 

The computation of the displacement vector between 
two given jets on the . same Gauss level (the disparity 
20 vector) , is now described. To compute the displacement 
between two consecutive frames, a method is used which 
was originally developed for disparity estimation in 
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stereo images, based on D. J . Fleet and A. D. Jepson. 
Computation of component image velocity from local 
phase information. In International Journal of Computer 
Vision, volume 5, issue 1, pages 77-104. 1990. and W. 
5 M. Theimer and H. A. Mallot . Phase-based binocular 

vergence control and depth reconstruction using active 
vision. In CVGIP: Image Undersanding , volume 60, issue 
3, pages 343-358. November 1994. 

The strong variation of the phases of the complex 
10 filter responses is used explicitly to compute the 
displacement with subpixel accuracy (Wiskott, 
"Labeled Graphs and Dynamic* Link "Matching for Face 
Recognition and Scene Analysis' 7 , Verlag Harri Deutsch, 
Thun- Frankfurt am Main, Reihe Physik 53 (PhD thesis, 
15 1995) . By writing the response J to the jth Gabor 
filter in terms of amplitude fl^and phase j, a 
similarity function can be defined as 



20 



Let J and J' be two jets at positions X and X'=X+d, 
the displacement d may be found by maximizing the 
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similarity S with respect to d, the k j being the 
wavevectors associated with the filter generating J j . 
Because the estimation of d is only precise for small 
displacements, i.e., large overlap of the Gabor jets, 
5 large displacement vectors are treated as a first 
estimate only, and the process is repeated in the 
following manner. First, only the filter responses of 
the lowest frequency level are used resulting in a 
first estimate d_l . Next, this estimate is executed 
10 and the jet J is recomputed at the position x_l=X+d_l, 
which is closer to the position X' of jet J' . Then, 
the lowest two frequency levels are used for the 
estimation of the displacement d_2 , and the jet J is 
recomputed at the position X_2 = X_l + d_2 . This is 
15 iterated until the highest frequency level used is 
reached, and the final disparity d between the two 
start jets J and J' is given as the sum d = d_l + d_2 + 

Accordingly, displacements of up to half the 
wavelength of the kernel with the lowest frequency may 
20 be computed (see Wiskott 1995, supra) . 

Although the displacements are determined 
using, floating point numbers, jets may be extracted 
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(i.e., computed by convolution) at (integer) pixel 
positions only, resulting in a systematic rounding 
error. To compensate for this subpixel error Ad , the 
phases of the complex Gabor filter responses should be 
5 shifted according to 

so that the jets will appear as if they were extracted 
10 at the correct subpixel position. Accordingly, the 
Gabor lets may be tracked with subpixel accuracy 
without any further accounting of rounding errors. 
Note that Gabor jets provide a substantial advantage in 
image processing because the problem of subpixel 
15 accuracy is more difficult to address in most other 
image processing methods. 

Tracking error may be detected by determining 
whether a confidence or similarity value is smaller 
than a predetermined threshold (block 84 of FIG. 9). 
20 The similarity (or confidence) value S may be 

calculated to indicate how well the two image regions 
in the two image frames correspond to each other 
simultaneous with the calculation of the displacement 
of a node between consecutive image frames. Typically, 
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the confidence value is close to 1, indicating good 
correspondence. If the confidence value is not close to 
1, either the corresponding point in the image has not 
been found (e.g., because the frame rate was too low 

5 compared to the velocity of. the moving object), or this 
image region has changed so drastically from one image 
frame to the next, that the correspondence is no longer 
well defined (e.g., for the node tracking the pupil of 
the eye the eyelid has been closed) . Nodes having a 

10 confidence value below a certain threshold may be 

switched off. 

A tracking, error , also may be detected when certain 
geometrical constraints are violated (block 86) . If 
many nodes are tracked simultaneously, the geometrical 
15 configuration of the nodes may be checked for 

consistency- Such geometrical constraints may be 
fairly loose, e.g., when facial features are tracked, 
the nose must be between the eyes and the mouth. 
Alternatively, such geometrical constraints may be 
20 rather accurate, e.g., a model containing the precise 
shape information of the tracked face. For 
intermediate accuracy, ; the constraints may be based on 



ft 
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a flat plane model. In the flat plane model, the nodes 
of the face graph are assumed to be on a flat plane. 
For image sequences that start with the frontal view, 
the tracked node positions may be compared to the 
5 corresponding node positions of the frontal graph 

transformed by an affine transformation to the actual 
frame. The 6 parameters of the optimal affine 
transformation are found by minimizing the least 
squares error in the node positions. Deviations 
10 between the tracked node positions and the transformed 
node positions are compared to a threshold. The nodes 
having deviations larger than the. threshold are 
switched off. The parameters of the affine 
transformation may be used to determine the pose and 
15 relative scale (compared to the start graph) 

simultaneously (block 88) . Thus, this rough flat plane 
model assures that tracking errors may not grow beyond 
a predetermined threshold. 

If a tracked node is switched off because of a 
20 tracking error, the node may be reactivated at the 

correct position (block 90) ,. advantageously using bunch 
graphs that include different poses and tracking 



PCT/US99/07933 

WO 99/53443 

28 

continued- from the corrected position (block 92) . 
After a tracked node has been • switched off, the system 
m ay wait until a predefined pose is reached for which a 
pose specific bunch graph exists. Otherwise, if only a 
5 frontal bunchgraph is stored, the system must until the 
frontal pose' is reached to correct any tracking errors. 
The stored bunch of jets may be compared to the image 
region surrounding the fit. position (e.g., from the 
flat plane model), which works in the same manner as 
10 tracking, except that instead of comparing with the jet 
of the previous image frame, the comparison is repeated 
with all jets of the: bunch. of examples, and the most 
similar one is taken. Because the facial features are 
known, e.g., the actual pose, scale, and. even the rough 
15 position, graph matching or .an exhaustive searching in 
the image and/or pose space is not needed and node 
tracking correction may be performed in real time. 

• For tracking correction, bunch graphs are not. 
needed for many different poses and scales because 
20 rotation in the image plane as well as scale may be 
taken into account by transforming -either the local 
- image- region, or- the jets, of the .bunch graph : accordingly 
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as shown in FIG. 11. In addition to the frontal pose, 
bunch graphs need to be created only for rotations in 
depth. 

The speed of the reinitialization process may be 

5 increased by taking advantage of the fact that the 

identity of the tracked person remains the same during 
an image sequence. Accordingly, in an initial learning 
session, a 'first sequence of the person may be taken 
with the person exhibiting a full repertoire of frontal 

10 facial' expressions. This first sequence may be tracked 
with high accuracy using the tracking- and correction 
scheme" described 'above-based ..on a large generalized 
bunch graph that contains knowledge about many 
different persons. This process may be performed 

15 offline and generates a new personalized bunch graph. 
The personalized bunch graph then may be used for 
tracking this person at a fast rate in real time 
because the personalized bunch graph is much smaller 
than the larger, generalized bunch graph. 

20 The speed of -the reinitialization process also may 

■ be increased by using a partial bunch graph 

reinitialization. A partial bunch, graph contains only a 
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subset of the nodes of., full bunch graph. The subset 
ma y be as small as only a single node. 

A pose estimation bunch graph makes use of a 
family of two-dimensional bunch graphs defined in the 
5- image plane. The different graphs within one family 

account for different poses and/or scales of the head. 
The landmark finding process attempts to match each 
bunch graph from the family to the input image in order 
to determine the pose or size of the head in the image. 
10 An example of such pose- estimation procedure is shown 
in FIG. 12., The first step of the pose estimation is 
equivalent to that , of. the regular .landmark finding. 
The image (block 98) is transformed (blocks 100 and 102) 
in order to use the graph similarity functions. Then, 
15 instead of only one, a family of three bunch graphs is 
used. The first bunch graph contains only the frontal 
pose faces (equivalent to the frontal view described 
above) . and the other two bunch graphs contain quarter- 
rotated faces . (one representing rotations to the left 
20 and one to the right). As before, the initial 

positions for each of the . graphs is in the upper left 
corner, and, the- pp.sitions of the graphs are scanned on 
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the image and the position and graph returning the 
highest similarity after the landmark finding is 
selected (blocks 104-114) 

After initial matching for each graph, the 
5 similarities of the final positions are compared (block 
116) . The graph that best corresponds to the pose 
given on the image will have the highest similarity. 
In FIG. 12, the left-rotated graph provides the best 
fit to the image, as indicated by its similarity (block 
10 118) . Depending on resolution and degree of rotation 
of the face in the picture, similarity of the correct 
graph and graphs for other poses would vary, becoming 
very close when the face is about half way between the 
two poses for which the graphs have been defined. By 
15 creating bunch graphs for more poses, a finer pose 
estimation procedure may be implemented that would 
discriminate between more degrees of head rotation and 
handle rotations in other directions (e.g. up or down). 
In order to robustly ' find a face at an arbitrary 
20 distance from the camera, a similar approach may be 

used in which two or three different bunch graphs each 
having different' scales may be used.. The face in the 
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image will be assumed *o have the same scale a, the 
b un=h graph that returns the most to the facial image. 

CiD) landmark finding 
A three dimensional l^u; xcn 

techniques related to the technique described above 
S also .ay use multiple bunch graphs adapted to different 
poses. However, the 3D approach employs only one bunch 
graph defined in 3D space. The geometry of the 3D 
graph reflects an average face or head geometry. By 
extracting ,ets from rmages of the faces of several 
,0 persons in different degrees of rotation, a 3D bunch 

graph is created which is analogous to the 2D approach. 
Each jet is now parameterized with the three rotation 
angles. As in the 2D approach, the nodes are located 
at the fiducial points of the head surface. 
15 Projections of the 3D graph are then used in the 

matching process. One important generalization of the 
3D approach is that every node has the attached 
parameterized family of bunch Jets adapted to different 
poses. The second generalization is that the graph may 
20 undergo Euclidean transformations in 3D space and not 
only transformations in the image plane. 
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The graph matching process may be formulated as a 
coarse-to-fine approach that first utilizes graphs with 
fewer nodes and kernels and then in subsequent steps 
utilizes more dense graphs. The coarse -to -fine 
5 approach is particularly suitable if high precision 

localization of the feature points in certain areas of 
the face is desired. Thus, computational effort is 
saved by adopting a hierarchical approach in which 
landmark finding is first performed on a coarser 
10 resolution, and subsequently the adapted graphs are 
checked at a higher resolution to analyze certain 
regions in finer detail. 

Further, the computational workload may be easily 
split on a mult i -processor machine such that once the 
15 coarse regions are found, a few child processes start 
working in parallel each on its own part of the whole 
image. At the end of the child processes, the 
processes communicate the feature coordinates that they 
located to the master process, which appropriately 
20 scales" and combines them to fit back into the original 
image thus considerably reducing the total computation 



time 
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As shown in FIG. 13, the facial features 
corresponding to the nodes may be classified to account 
for inappropriate tracking error indications such as, 
for example, blinking or mouth opening. Labels are 
5. attached to the different jets in the bunch graph 
corresponding- the facial features, e.g.. eye 
open/closed, mouth open/closed, etc. The labels may be 
copied along with the corresponding jet in the bunch 
graph which is most similar one compared to the current 
10 image. The label tracking may be continuously 

mon itored, regardless of whether a tracking error 
detected. Accordingly, classification nodes may be 
attached to the tracked nodes for the following: 



15 



20 



- eye open/closed 

- mouth open/closed 

- tongue visible or not 

- hair type classification 

- wrinklfdetection (e.g., on the forehead) 



Thus, tracking allows utilization of two 
information sources. One information source is based 
on the feature locations-, i.e. the node positions, and 
25 the other information source is based on the feature 
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classes. The feature class information is more texture 
based and, by comparing the local image region with a 
set of stored examples, may function using lower 
resolution and tracking accuracy then feature class 
5 information that is based solely on the node positions. 

The facial sensing of the invention may be applied 
to the creation and animation of static and dynamic 
avatars as shown in FIG. 14. The avatar may be based 
on a generic facial model or based on a . person specific 
10 facial model. The tracking and facial expression 

recognition may be used for the incarnation the avatar 
with the person's features.. 

The generic facial model may be adapted to a 
representative number of individuals and may be adapted 
15 to perform realistic animation and rendering of a wide 
range of facial features and/or expressions. The 
generic a model may be obtained by the following 
techniques . 

20 1- Mono-camera systems may be used (T. Akimoto et al. 
1993) to produce a realistic avatar for use in low-end 
tele- immersion systems. Face profile information of 
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individuals, as seen from sagital and coronal planes, 
may be merged to obtain the avatar. 

2 . stereo-camera systems are able to perform accurate 
3-0 measurements when the cameras are fully calibrated 
5 (camera parameters are computed through a calibration 
process) . An individual facial model may then be 
obtained by fitting a generic facial model to the 
obtained 3-D data. Because stereo algorithms do not 
provide accurate information on non- textured areas, 
10 projection of active -textured light may be used. 

3. Feature-based stereo techniques where markers are 
used on the individual f ace .- to compute precise 3-D 

* ^hp markers 3-D information is then used 
positions of the markers. 

to fit a generic model. 
15 4 . 3-dimensionnal digitizers in which a sensor or 

iocating device is moved over each surface point to be 
measured. 

5. Active structured light where patterns are 
projected and the resulting video stream is processed 
20 to extract 3D measurements . 
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6. Laser-based surface scanning devices . (such as those 
developed by Cyberware, Inc) that provide accurate face 
measurements . 

7. A combination of the previous techniques. 

5 These differing techniques are not of equal convenience 
to the user. Some are able to obtain measurements on 
the individual in a one-time process (the face being in 
a desired posture for the duration of the measurement) , 
while others need a collection of samples and are more 
10 cumbersome to use. 

A generic three-dimensional head model for a 
specific person can be generated using two facial 
images showing a frontal and a profile view. Facial 
sensing enables efficiently and robustly generation of 
15 the 3-D head model. 

Facial contour extraction is performed together 
with the localization of the person's eyes, nose, mouth 
and chin. This feature location information may be 
obtained by using the using the elastic bunch graph 
20 technique in combination with hierarchical matching to 
automatically extract facial features as shown in FIG. 
14. The feature location information may then be 
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combined (see T. Akimoto and Y. Suenaga. Automatic 
Creation of 3D Facial Models. In IEEE Computer 
Graphics & Applications, pages 16-22. September 1993.) 
to produce a 3D model of the person's head. A generic 
5 three dimensional head model is adapted so that its 
proportions are related to . the image's measurements. 
Finally, both side and front images may be combined to 
obtain a better texture model for the avatar, i.e. the 
front view is used to texture map the front of the 
10 model and the side view is used for the side of the 

model. Facial sensing improves this technique because 
extracted features . may be labeled (known points may be 
defined in the profile) so that the two images need not 
be taken simultaneously. 
15 An avatar image may be animated by the following 

common techniques (see F. I. Parke and K. Waters.. 
Computer Facial Animation- A K Peters , Ltd. Wellesley, 
Massachusetts, 1996) . 

li Key framing and geometric interpolation, where a 
20 number of key poses and expressions are defined. 

Geometric interpolation, is then used between the key 
frames . to provide animation . Such a system is . 
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frequently referred to as a performance -based (or 
performance-driven) model. 

2. Direct parameterization which directly maps 
expressions and pose to a set of parameters that are 

5 then used to drive the model. 

3. Pseudo-muscle models which simulate muscle actions 
using geometric deformations. 

4. Muscle-based models where muscles and skin are 
modeled using physical models. 

10 5. 2-D and 3-D Morphing which use 2D morphing between 
images in a video stream to produce 2D animation. A set 
of landmarks are identified ■•and used to warp between 
two images of a sequence. Such a technique can be 
extended to 3D (See, F. F. Pighin, J. Hecker, D. 
15 Lischinski, R. Szeliski, and D. H. Salesin. 

Synthesizing Realistic Facial Expressions from 
Photographs. In SIGGRAPH 98 Conference Proceedings, 
pages 75-84. July 1998.). 

6. Other approaches such as control points and finite 

20 element models. 

For these techniques, facial sensing enhances the 
animation process by providing automatic • extraction and 
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- characterization -of facial features. Extracted 

features may be used to interpolate expressions in the 
case of key framing and interpolation models, or to 
select parameters for direct parameterized models or 
5 pseudo-muscles or muscles models. In the case of 2-D 
and 3-D morphing. facial sensing may be used to 
automatically select features on a face providing the 
appropriate information to perform the geometric 
transformation. 
10 An example of an avatar animation that uses 

facial feature tracking and classification may be shown 
with respect to FIG. .1.5, .During the training phase the 
individual is prompted for a series of predetermined 
facial expressions (block 120) , and sensing is used to 
track the features (block 122) . At predetermined 
locations, jets and image patches are extracted for the 
various expressions (block 124) . Image patches 
surrounding facial features are collected along with 
the jets 126- extracted from these features. These jets 
20 are used later to classify or tag facial features 128. 
This is done by using these jets to generate a 
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personalized bunch graph and by applying t he- 
classification method described above. 

As shown in FIG. 16, for animation of an avatar, 
the system transmits all image patches 128, as well as 
5 the image of the whole face 130 (the "face frame") 

minus the parts shown in the image patches to a . remote 
site (blocks 132 & 134). The software for the 
animation engine also may need to be transmitted. The 
sensing system then observes the user's face and facial 
10 sensing is applied to determine which of the image 
patches is most similar to the current facial 
expression (blocks 136 & 138) The image tags are 
transmitted to the remote site .(block 140) allowing the 
animation engine to assemble the face 142 using the 
15 correct image patches. 

To fit the image patches smoothly into the image 
frame, Gaussian blurring may be employed. For 
realistic rendering, local image morphing may be needed 
because the animation may not be continuous in the 
20 sense that a succession of images may be presented as 
imposed by the sensing, The morphing may be realized 
using linear interpolation of corresponding points on 
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the image space. To create intermediate images, linear 
interpolation is applied using the following equations: 

P. = (2-i)P x + (i-D p 2 (7) 

5 

l i = (2-i)I x + (i-Di 2 (8) 

where P, and P 2 are corresponding points in the images 
I, and I 2 , and I, is the i th interpolated image: with 1< 
10 i < 2. Note that for process efficient, the image 

interpolation may be implemented using a pre-computed 
hash table for P A and I,. The number and accuracy of 
points used, and their accuracy , the interpolated 
facial model generally determines the resulting image 
15 quality. 

Thus, the reconstructed face in the remote display 
. may be composed by assembling pieces of images 
corresponding to the detected expressions in the 
learning step. Accordingly, the avatar exhibits 
20 features corresponding to the person commanding the 
animation. Thus, at initialization, a set of cropped 
images corresponding to each tracked, facial feature and 
a "face container" as. the resulting image of the face 
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after each feature is removed. The animation is 
started and facial sensing is used to generate specific 
tags which are transmitted as described previously. 
Decoding occurs by selecting image pieces associated 
5 with the transmitted tag, e.g., the image of the mouth 
labeled with a tag w smi ling-mouth" 146 (FIG. 16) . 

A more advanced level of avatar animation may be 
reac hed when the aforementioned dynamic texture 
generation is integrated with more conventional 
10 techniques of volume morphing as. shown in FIG. 17) . To 
achieve volume morphing, the location of the node 
positions may be used to drive control points on a mesh 
150. Next, the textures 152 dynamically generated 
using tags are then mapped onto the mesh to generate a 
15 realistic head image 154. An alternative to using the 
sensed node positions as drivers of control points on 
the mesh is to use the tags to select local morph 
targets. A morph target is a local mesh configuration 
that has been determined for the different facial 
20 expressions and gestures' for which sample jets have 
been collected. These local mesh geometries can be 
determined by stereo vision techniques. The use of 
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morph targets is further developed in the following 
references community (see, J. R. Kent, W. B. Carlson, 
and R. E . Parent.- Shape Transformation for Polyhedral 
Objects, in SIGGRAPH 92 Conference Proceedings , volume 
5 26, pages 4 7-54. August 1992; Pighin et al . 1998, 
supra) . 

A useful extension to the vision-based avatar • 
animation is to integrate the facial sensing with 
speech analysis in order to synthesize the correct lip 
10 motion as shown in FIG. 18. The lip synching technique 
is particularly useful to map lip. motions resulting 
from speech onto an avatar. It. is also helpful as a 
back-up in case the vision-based lip tracking fails. 
Although the foregoing discloses the preferred 
15 embodiments of the present invention, it is understood 
that those skilled in the art may make various changes 
to the preferred embodiments without departing from the 
scope of the invention. The invention is defined only 
the following claims. 

20 
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What is r-1 aimed is: 

1. A method for feature sensing on a sequence of 
image frames, comprising: 
5 a step for transforming each image frame, using a 

wavelet transformation to generate a transformed image 
frame; 

a step for initializing node locations associated 
with wavelets jets of a model graph to the transformed 
10 image frame by moving the model graph across the 

transformed image- frame and placing the model graph at 
a location in the transformed, image frame of maximum 
jet similarity between the wavelet jets at the node 
locations and the transformed image frame; 
15 a step for tracking the location of one or more 

node locations of the model graph between image frames; 
and 

a step for reinitializing a tracked node if the 
node's position deviates beyond a predetermined 
20 position constraint between image frames. 
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.2. A method for feature sensing as defined in 
claim 1, wherein the model graph used in the 
initializing step. is based on a predetermined pose. 

5 3. A method for feature sensing as defined in 

claim 1, wherein the tracking step tracks the node 
locations using elastic bunch graph matching. 

4. A method for feature sensing as defined in 
10 claim 1, wherein the tracking step uses linear position 
prediction to predict node locations in a subsequent 
image frame and the reinitializing step reinitializes a 
node location based on a deviation from the predicted 
node location that is greater than a predetermined 
15 threshold deviation. 

5. A method for feature sensing as defined in 
claim 1, wherein the . predetermined position constraint 
is based on a geometrical position constraint 
20 associated with relative positions between the node 
locations . 
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6. A method for feature sensing as defined in 
claim 1, wherein the node locations are transmitted to 
a remote site for animating an avatar image. 

5 7. A method for feature sensing as defined in 

claim 1, wherein the tracking step includes determining 
a facial characteristic. 



8. A method for feature sensing as defined in 
10 claim 7 , further comprising transmitting the node 

locations and facial characteristics, to a remote site 
for animating an avatar imaige having facial 
characteristics which are based upon the facial 
characteristics determined by the tracking step. 

15 

9. A method for feature sensing as defined in 
claim 7, wherein the facial characteristic determined 
by the tracking step is whether mouth is open or 
closed. 
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10. A method for feature sensing as defined in 
claim 7, wherein the facial characteristic determined 
by the tracking step is whether eyes are open or 
closed. 

5 

11. A method for feature sensing as defined in 
claim 7 , wherein the facial characteristic determined 
by the tracking step is whether tongue is visible in 
the mouth . 



10 



15 



20 



12. A method for feature sensing as defined in 
claim 7. wherein the facial characteristic determined 
by the tracking step is based on facial wrinkles 
detected in the image . 

13. A method for feature sensing as defined in 
claim 7, wherein the facial characteristic determined 
by the tracking step is based 'on hair type. 

14. A method for feature sensing as defined in 
claim 7, wherein each facial characteristic is 
associated by training with an -image tag . that 



WO 99/53443 PCT/US99/07933 

49 

identifies an image segment of the image frame that is 
associated with the facial . characteristic . 

15. A method for feature sensing as defined in 

5 claim 14, wherein the image segments identified by the 
associated image tag are morphed into an avatar image. 

16. A method for feature sensing as defined in 
claim 14, wherein the node locations and feature tags 

10 are used for volume morphing the corresponding image 
segments into a three-dimensional image. 

17. A method for feature sensing as defined in 
claim 7 , wherein the model graph comprises 18 location 

15 nodes associated with distinguishing features on a 
human face . 

18. A method for feature sensing as defined in 
claim 17, wherein the 18 node locations of the face are 

20 associated with, respectively, 
a right eye. pupil ; 
a left eye pupil; . 
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a top of a nose; 

a right comer of a right eyebrow; 
a left corner of the right eyebrow; 
a left corner of a left eyebrow; 
5 a right nostril; 

a tip of the nose; 
a left nostril; 
a right corner of a mouth ; 
a center of an upper lip; 
a left corner of the mouth; 
a center of a lower lip; 
a bottom of a right ear; : , 
a top of the right ear; 
a top of a left ear; and 
15 a bottom of the left ear. 

19. A method for facial feature sensing as 
defined in claim 1, wherein the node locations tracking 
step includes lip synching based on audio signals 
20 associated with movement of the node locations of a 
mouth generating the audio signals. 
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20. A method for individualizing a head model 
based on facial feature finding, wherein facial feature 
finding is based on elastic bunch graph matching. 

5 21. A method for individualizing a head model as 

defined in claim 20, wherein the matching is performed 
using a coarse- to- fine approach. 

22. Apparatus for feature sensing on a sequence 
10 of image frames, comprising: 

means for transforming each image frame using a 
wavelet transformation- to generate- a transformed image 
frame; 

means for initializing node, locations associated 
15 with wavelets jets of a model graph to the transformed 
image frame by moving the model graph across the 
transformed image frame and placing the model graph at 
a location in the transformed image frame of maximum 
jet similarity between the wavelet jets at the node 
20 locations and the transformed image frame; 
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means for tracking the location of one or more 
node locations of the model graph between image frames; 
and 

means for reinitializing a tracked node if the 
5 node's position deviates beyond a predetermined 
position constraint between image frames. 

23. Apparatus for feature sensing as defined in 
claim 22, further comprising: 
10 means for determining a facial characteristic; and 

means for animating an avatar image having facial 
characteristics which are based upon the facial 
characteristics generated by the determining means. 

15 24. Apparatus for feature sensing as defined in 

claim 23, wherein the model graph comprises 18 location 
nodes associated with distinguishing features on a 
human face . 

20 25. A model graph for facial image analysis 

comprising 18 location nodes associated with 
distinguishing features on a human face. 
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26. A model graph for facial image analysis as 
defined in claim 25, wherein the 18 node locations of 
the face are associated with, respectively, 
a right eye pupil; 
5 a left eye pupil; 

a top of a nose ; 

a right corner of a right eyebrow; 
a left corner of the right eyebrow; 
a left corner of a left eyebrow; 
10 a right nostril ; 

a tip of the nose; 
a left nostril; / . . . 
a right corner of a mouth; 
a center of an upper lip; 
15 • a left corner of the mouth; 

a center of a lower lip; 
a bottom of a right ear; 
a top of the right ear; 
a top of a left ear; and 
20 a bottom of the left ear. 



WO 99/53443 



1/16 



PCT/US99/07933 




SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



2/16 



PCT/US99/07933 




SUBSTITUTE SHEET (RULE 26) 



PCT/US99/07933 

WO 99/53443 

3/16 



V 




CO 

6 

LL 



SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



PCT/US99/07933 




Transformation 
into Gabor space 








Initial placement of 
the bunch graph on 
the transformed image 







46 



48 



Global move of 
the graph across 
the whole image 



4/16 








Graph position 
with the maximum 
similarity value 



50 




Additional scaling of the 

graph in the found 
position and localization 
of the best matching points 




52 





FIG. 4 

SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



5/16 



PCT/US99/07933 



convolution result 
Gabor wavelets imaginary part magnitude 




FIG. 5 



SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



PCI7US99/07933 



6/16 




FIG. 7 



SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



7/16 



PCTAJS99/07933 




FIG. 8 



SUBSTITUTE SHEET (RULE 26) 




SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



9/16 



PCT/US99/07933 




en 




sucipei in o|§uy 




I I Mill '1 



M il l " 
I t 

i : 
> i 



1 I I I | 



1 1 1 1 1 1 i 



I .... I 



2S 



e 



o o 
o o o 3 



i,i i 1 1 1 i 



SUClpBJ U! 3j§UV 



o 

Li. 



SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



PCT/US99/07933 



10/16 




Original 
Image 




98 



Transformation in 
the Gabor space 





100 



102 



Initial graph placement Final graph placement 



Frontal bunch 
graph matching 




Frontal Graph 
similarity. Sf 



Left Rotation 
bunch graph matching 




Left Rotation 
Graph similarity, Sj 





Sp0.74 



.£1 



110 
112 



Right Rotation 
bunch graph matching 

— ^ L_ 

Right Rotation 
Graph similarity, Sr 




I Similarity Comparison 2=i 

— r 





S|=0.96 





S r =0.69 




Output of the pose 
with the highest S 




118 




FIG- 12 

CI IDCTITI ITC CUCCT /m ii r* ^/?\ 



WO 99/53443 



PCTAJS99/07933 



11 / 1.6 




FIG. 13 



SUBSTITUTE SHEET (RULE 26) 




SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 



PCT/US99/07933 



13/16 




SUBSTITUTE SHEET (RULE 26) 



WO 99/53443 PCT/US99/07933 

14/ 16 




SUBSTITUTE SHEET (RULE 26) 



PCT/US99/07933 

WO 99/53443 

15/16 




Morph3D Generate 
Mesh : Texture Map 




Render Head 




SUBSTITUTE SHEET (RULE 26) 




SUBSTITUTE SHEET (RULE 26) 



- r 



INTERNATIONAL SEARCH REPORT 



Intern ~" ->al Application No 

PCT7US 99/07933 



\,Tr^VI^ ^9/00 



According to international Patent 



nPta or to both national dasaitlcation and IPC 



IPC 6 G06T G06K 



Fsueh documents are included in tho Irakis searched 



-Documen.a.on s e a,ched omer men mmimum documontal.on to the ertent ma, , 

7ot data base and. where practical, search terms used) 



Electronic data oase 



consulted during me international search (name c 



C. r rnNSIDEBED TO BE RELEVANT 



Category • C«at.on o1 document. with 



indication, where appropriate, ot the relevant passages 



MAURER T ET AL: "Tracking and learning 
graphs and pose on image sequences of 

PROCEEDINGS OF THE SECOND INTERNATIONAL 
CONFERENCE ON AUTOMAT IC FACE AND GESTURE 
RFrnr NTT ION (CAT N0.96TB100079), 
PRO?EEDINGS Of THE SECOND INTERNATIONAL 
CONFERENCE ON AUTOMATIC FACE AND GESTURE 
RECOGNITION, KILLINGTON VI r USA ,14-16 
OCT 1996 pages 176-181, XP0021156U9 
1996 Los'Alamitos, CA USA IEEE Comput. 
Soc Press, USA ISBN: 0-8186-7713-9 
page 176, right-hand column, paragraph 2 
-page 178, paragraph 4^ 

-/— 



Relevant to claim No. 



1,6,20, 
22,25 



I Jjj^ Further documents are Usted in the continuation ot box C. 
• Special cateqones ot cited documents : 

-A- document dining the genera! state oUhe art which is not 
A cohered «o be of particular relavance 
•6- earlier document but published on or after the .ntemal.onai 
filing date 

. ^ mrnw doubts on prtonty claim(s) or 

citation or other special reason (as specdied) 
■CP document reternna to an ora. disclosure, use, exhfifon or 
other means 

-P- oocumenipuDKshed prior to the international lung date but 
laiar than the priorrty d ate claimed ^_ 

17 September 1999 



ID 



Patent family members are listed in annex. 



later document published after the international til^ig date 
or prSnVy^ate and not in conflict with the appta»»n but 
cSo understand the pnndpto or theory underlying the 
invention 

- dopant of particular relevance: the claimed invention 

^combined with one or «™ 
ments. such coronation being obvious to a person skilled 
in the art. 

I- document member of the same patent family ^ 



[ Name and mailing address of the ISA 
_ r\fi\r 



ling address 01 ine 
European Patent Office. P.B. 5818 Patentlaan2 
mi _ 2280 HV fttjswi|K 
Tel <*3i-70> 3^0-2040. T*. 31 651 epo nl. 
Fax: (-31-70) 340-3016 



05/10/1999 



Authorized officer 



Chateau, J-P 



Form PCT/ISAtf 10 (second sheoo (July <W2) 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

□ BLAC K BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGD3LE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



